Assignment Instructions:

Assignment submission (YOUR NAME): Vedika Shirtekar


library(tidyverse)
library(here)
library(janitor)
library(estimatr)  
library(performance)
library(jtools)
library(gt)
library(gtsummary)
library(interactions) 
library(ggridges)

Introduction

You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.

Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.

We will consider the MPA sites the treatment group and use regression methods to explore whether protecting these reefs really makes a difference compared to non-MPA sites (our control group). In this assignment, we’ll think deeply about which causal inference assumptions hold up under the research design and identify where they fall short.

Let’s break it down step by step and see what the data reveals!


Step 1: Anticipating potential sources of selection bias

a. Do the control sites (Arroyo Quemado, Carpenteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is ceteris paribus or whether selection bias is likely (be specific!).

The control sites (Arroyo Quemado, Carpenteria, and Mohawk) are reasonably located along the same coastline as the treatment sites (Naples and Isla Vista), helping to control for larger-scale environment differences such as water temperature, wave intensity, and additional environmental factors. However, there are potential sources of selection bias, as the treatment sites were chosen for MPA protection and could potentially be correlated with preexisting ecological conditions (ex. high lobster densities or improved habitat quality) that differ from the control sites (non-MPAs). While the proximity of the control sites can provide the ceteris paribus comparison for regional environmental conditions, influence from human activity or differences in reef structure could influence spiny lobster counts independent of MPA status.


Step 2: Read & wrangle data

a. Read in the raw data from the “data” folder named spiny_abundance_sb_18.csv. Name the data.frame rawdata

b. Use the function clean_names() from the janitor package

# HINT: check for coding of missing values (`na = "-99999"`)
rawdata <- 
    read_csv(here::here("data", "spiny_abundance_sb_18.csv"), na = "-99999") %>% 
    clean_names()

c. Create a new df named tidyata. Using the variable site (reef location) create a new variable reef as a factor and add the following labels in the order listed (i.e., re-order the levels):

"Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples"
tidydata <- rawdata %>% 
  mutate(
    reef = factor(
      case_when(
        site == "AQUE" ~ "Arroyo Quemado", 
        site == "CARP" ~ "Carpenteria", 
        site == "MOHK" ~ "Mohawk", 
        site == "IVEE" ~ "Isla Vista", 
        site == "NAPL" ~ "Naples"
      ),
      levels = c(
        "Arroyo Quemado",
        "Carpenteria",
        "Mohawk",
        "Isla Vista",
        "Naples"
      )
    )
  )

Create new df named spiny_counts

d. Create a new variable counts to allow for an analysis of lobster counts where the unit-level of observation is the total number of observed lobsters per site, year and transect.

  • Create a variable mean_size from the variable size_mm
  • NOTE: The variable counts should have values which are integers (whole numbers).
  • Make sure to account for missing cases (na)!

e. Create a new variable mpa with levels MPA and non_MPA. For our regression analysis create a numerical variable treat where MPA sites are coded 1 and non_MPA sites are coded 0

#HINT(d): Use `group_by()` & `summarize()` to provide the total number of lobsters observed at each site-year-transect row-observation. 
counts <- tidydata %>% 
 # mutate(size_mm = na_if(size_mm, -99999)) %>% 
  group_by(site, year, transect, reef) %>% 
  summarize(
    count = sum(count, na.rm = TRUE),
    mean_size = mean(size_mm, na.rm = TRUE),
    .groups = "drop"
  )

#HINT(e): Use `case_when()` to create the 3 new variable columns
spiny_counts <- counts %>% 
  mutate( 
    mpa = 
      case_when(
        site %in% c("NAPL", "IVEE") ~ "MPA",
        site %in% c("AQUE", "CARP", "MOHK") ~ "non_MPA"), 
    treat = 
      case_when(
        mpa == "MPA" ~ 1, 
        mpa == "non_MPA" ~ 0
        
      )
    ) 

NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!


Step 3: Explore & visualize data

a. Take a look at the data! Get familiar with the data in each df format (tidydata, spiny_counts)

## tidydata
str(tidydata)
## tibble [4,362 × 11] (S3: tbl_df/tbl/data.frame)
##  $ year     : num [1:4362] 2012 2012 2012 2012 2012 ...
##  $ month    : num [1:4362] 8 8 8 8 8 8 8 8 8 8 ...
##  $ date     : Date[1:4362], format: "2012-08-20" "2012-08-20" ...
##  $ site     : chr [1:4362] "IVEE" "IVEE" "IVEE" "IVEE" ...
##  $ transect : num [1:4362] 1 1 1 1 2 2 2 2 3 3 ...
##  $ replicate: chr [1:4362] "A" "B" "C" "D" ...
##  $ size_mm  : num [1:4362] NA NA NA NA NA NA NA NA 70 60 ...
##  $ count    : num [1:4362] 0 0 0 0 0 0 0 0 1 1 ...
##  $ num_ao   : num [1:4362] 0 0 0 0 0 0 0 0 0 0 ...
##  $ area     : num [1:4362] 300 300 300 300 300 300 300 300 300 300 ...
##  $ reef     : Factor w/ 5 levels "Arroyo Quemado",..: 4 4 4 4 4 4 4 4 4 4 ...
## spiny_counts
str(spiny_counts)
## tibble [252 × 8] (S3: tbl_df/tbl/data.frame)
##  $ site     : chr [1:252] "AQUE" "AQUE" "AQUE" "AQUE" ...
##  $ year     : num [1:252] 2012 2012 2012 2012 2012 ...
##  $ transect : num [1:252] 1 2 3 4 5 6 7 1 2 3 ...
##  $ reef     : Factor w/ 5 levels "Arroyo Quemado",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ count    : num [1:252] 5 9 0 9 11 0 4 7 2 0 ...
##  $ mean_size: num [1:252] 64.2 66 NaN 74.1 76.9 ...
##  $ mpa      : chr [1:252] "non_MPA" "non_MPA" "non_MPA" "non_MPA" ...
##  $ treat    : num [1:252] 0 0 0 0 0 0 0 0 0 0 ...

b. We will focus on the variables count, year, site, and treat(mpa) to model lobster abundance. Create the following 4 plots using a different method each time from the 6 options provided. Add a layer (geom) to each of the plots including informative descriptive statistics (you choose; e.g., mean, median, SD, quartiles, range). Make sure each plot dimension is clearly labeled (e.g., axes, groups).

Create plots displaying the distribution of lobster counts:

  1. grouped by reef site
  2. grouped by MPA status
  3. grouped by year

Create a plot of lobster size :

  1. You choose the grouping variable(s)!
# plot 1: Distribution of Spiny Lobster Counts by Reef Site
site_medians <- spiny_counts %>%
  group_by(site) %>% 
  summarize(median_count = median(count), .groups = "drop")


spiny_counts %>%
  ggplot(aes(x = count, fill = site)) +
  geom_histogram(bins = 35, color = "white", alpha = 0.6) +
  geom_vline(
    data = site_medians,
    aes(xintercept = median_count),
    color = "darkblue",
    linetype = "dashed",
    linewidth = .5 
  ) +
    scale_fill_brewer(palette = "Set1") + 

  facet_wrap(~site, scales = "free_y") +
  theme_bw(base_size = 10) +
  theme(legend.position = "none") +
  labs(
    x = "Lobster Counts",
    y = "Frequency",
    title = "Distribution of Spiny Lobster Counts by Reef Site",
    subtitle = "Dashed line indicates median lobster abundance per site"
  )

# plot 2.. Distribution of Spiny Lobster Counts by Treatment (MPA Designation)
spiny_counts %>%
    mutate(treat = factor(treat, levels = c(0, 1), labels = c("Non-MPA", "MPA"))) %>%
  ggplot(aes(x = factor(treat), y = count, fill = treat)) +
  geom_violin(trim = FALSE, draw_quantiles = c(0.25, 0.5, 0.75), alpha = 0.7) +
  geom_jitter(width = 0.15, size = 1, alpha = 0.5, color = "black") +
  scale_fill_manual(values = c("Non-MPA" = "#FB6A4A", "MPA" = "#6BAED6")) +
  theme_bw(base_size = 10) +
  labs(
    x = "Treatment",
    y = "Lobster Counts",
    fill = "MPA Designation",
    title = "Distribution of Spiny Lobster Counts by Treatment (MPA Designation)",
    subtitle = "Lines reference quartiles (25%, 50%, 75%)"
  ) +
  theme(legend.position = "top")

# plot 3... Distribution of Spiny Lobster Counts (2012-2018)
 spiny_counts %>%
  ggplot(
    aes(
      x = count,
      y = fct_rev(factor(year)),
      fill = stat(quantile)
    )
  ) +
  geom_density_ridges_gradient(
    quantiles = c(0.25, 0.5, 0.75),
    quantile_lines = T,
    scale = 1,
    alpha = 0.3
  ) + scale_fill_viridis_d(option = "J", alpha = .6)+ 
  theme_bw(base_size = 10) +
  labs(
    title = "Distribution of Spiny Lobster Counts (2012-2018)",
    subtitle = "IQR (25-75%) outlined",
    x = "Lobster Counts",
    y = "Year", 
    fill = "Quantile"
    
  )

# plot 4... Mean lobster size by reef with highlighted average of reef mean sizes
beeswarm::beeswarm(
  mean_size ~ factor(reef), 
  pch = 19, 
  data = spiny_counts,
  col = 2:6, 
  cex.axis = 0.8,
  main = "Distribution of Mean Lobster Sizes by Reef \n(Diamonds represent reef-level means)",
  xlab = "Reef",
  ylab = "Mean Lobster Size (mm)"
)

reef_means <- spiny_counts %>% 
  group_by(reef) %>% 
  summarize(average_mean_dist = mean(mean_size, na.rm = TRUE))

# Add on point to indicate reef-level mean (contrary to ggplot, need to add manually)
points( # Convert to numeric because points() expects numeric coordinates
  x = as.numeric(factor(reef_means$reef)),
  y = reef_means$average_mean_dist,
  pch = 23,  
  bg = "black"
)

c. Compare means of the outcome by treatment group. Using the tbl_summary() function from the package gt_summary

# USE: gt_summary::tbl_summary()
spiny_counts %>% 
    tbl_summary(by = mpa, include = count, 
                             statistic = list(all_continuous() ~ "{mean} ({sd})"),
                             missing = "no")  %>% add_p()
Characteristic MPA
N = 119
1
non_MPA
N = 133
1
p-value2
count 28 (44) 23 (39) 0.3
1 Mean (SD)
2 Wilcoxon rank sum test

Step 4: OLS regression- building intuition

a. Start with a simple OLS estimator of lobster counts regressed on treatment. Use the function summ() from the jtools package to print the OLS output

b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience.

There are about 23 spiny lobsters predicted, on average, in non-MPA areas. MPA areas have, on average, about 5 more spiny lobsters than non-MPA areas. Marine protected areas are likely associated with higher lobster abundance; there is evidence to suggest that the influence of MPA designation on the predicted spiny lobster count in a measured reef is unlikely due to random chance alone (p < .05).

# NOTE: We will not evaluate/interpret model fit in this assignment (e.g., R-square)

m1_ols <- lm(count ~ treat, data = spiny_counts)

summ(m1_ols, model.fit = FALSE) 
Observations 252
Dependent variable count
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 22.73 3.57 6.36 0.00
treat 5.36 5.20 1.03 0.30
Standard errors: OLS

c. Check the model assumptions using the check_model function from the performance package

d. Explain the results of the 4 diagnostic plots. Why are we getting this result?

  • qqplot: The residuals do not fall along the line and show a pattern (curve), indicating that the relationship between the treatment group and the lobster count may not be linear.

  • Normality of Residuals: The distribution of the residuals is non-normal, indicating that the model may not capture underlying patterns between variables or potential outliers in the spiny lobster count.

  • Homogeneity of Variance: There is clear heteroscedasticity in the homogeneity of variance graph, indicating that the spread or variance of the spiny lobster count is not equal across the treatment groups (MPA or non-MPA).

  • Posterior Predictive Check: The posterior predictive check indicates a poor model fit, as the simulated outcomes diverge from the observed data. This suggests that the relationship between the treatment and spiny lobster count is not sufficiently captured by a linear regression model; another model, such as a Poisson regression for count data, may be a better alternative.

check_model(m1_ols,  check = "qq" )

check_model(m1_ols, check = "normality")

check_model(m1_ols, check = "homogeneity")

check_model(m1_ols, check = "pp_check")


Step 5: Fitting GLMs

a. Estimate a Poisson regression model using the glm() function

b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience.

Spiny lobster counts are significantly higher in MPA designated areas compared to non-MPA sites (p < .05). MPAs are associated with a 24% higher average spiny lobster count compared to non-MPA sites.

c. Explain the statistical concept of dispersion and overdispersion in the context of this model.

In a Poisson regression, dispersion refers to the equivalence of the mean and variation in the response/outcome variable, such as the spiny lobster count in this model. With overdispersion, the variance in the outcome variable (ie. spiny lobster counts) exceeds the mean, indicating that the model could underestimate the standard errors from the model. Overdispersion in this model could indicate variation in spiny lobster counts due to under-represented or unaccounted factors, such as habitat differences within sites.

d. Compare results with previous model, explain change in the significance of the treatment effect.

Compared to the previous linear model, the treatment effect in the Poisson model is now statistically significant. This change occurs because the Poisson model appropriately accounts for the discrete count data and the mean-variance relationship for counts. The linear model previously assumed constant variance and a continuous response; by modeling the data on the log scale with the Poisson distribution, the model better captures the underlying distribution of spiny lobster counts. As a result, the Poisson model reveals a significant positive effect of treatment (MPA) on lobster abundance compared to non-MPA transects.

#HINT1: Incidence Ratio Rate (IRR): Exponentiation of beta returns coefficient which is interpreted as the 'percent change' for a one unit increase in the predictor 


#HINT2: For the second glm() argument `family` use the following specification option `family = poisson(link = "log")`

m2_pois <- glm(count ~ treat, data = spiny_counts, family = poisson(link = "log"))


summ(m2_pois, exp = T) # exp will return the IRR (exponentation of beta)
Observations 252
Dependent variable count
Type Generalized linear model
Family poisson
Link log
𝛘²(1) 71.36
p 0.00
Pseudo-R² (Cragg-Uhler) 0.25
Pseudo-R² (McFadden) 0.01
AIC 11365.62
BIC 11372.68
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 22.73 21.93 23.55 171.74 0.00
treat 1.24 1.18 1.30 8.44 0.00
Standard errors: MLE
# Calculate percent change: ((New IRR - Old IRR) / Old IRR) * 100, where "1" refers to the treatment group
print(paste("The percent change in the predictor is", ((1.24-1)/1)*100, "%."))
## [1] "The percent change in the predictor is 24 %."

e. Check the model assumptions. Explain results.

  • Posterior Predictive Check: The model’s predicted intervals slightly correspond to the observed data; however, the simulated intervals still do not accurately fit or capture the range of the observed data, indicating a poor fit likely due to overdispersion or zero-inflation.

  • Misdirected Dispersion and Zero-Inflation: The observed residual variance is higher than the predicted residual variance, especially at larger predicted means. As such, the model may underestimate the variance in the count outcome and not account for zero-inflation (many zeroes) or overdispersion.

  • Homogeneity of Variation: The plot shows that many residuals are spread unevenly compared to the reference line. As a result, constance variance may be violated; residuals appear to be heteroscedastic.

  • Influential Observation: All points appear to be near the boundary of the contour with few extreme residuals; some observations have a higher influence which could potentially affect the parameter estimates in the model.

  • Distribution of Quantile Residuals: There is a clear deviation away from the line in the qqplot. As a result, the residuals are not normally distributed.

f. Conduct tests for over-dispersion & zero-inflation. Explain results.

The Poisson distribution assumes that the mean and variance are equal. However, the variance is much larger than the mean and the p value is very significant, indicating strong overdispersion (the model is underestimating the variability in the data). Although the model predicts no zeros, the data indicates 27 zeros. This indicates the model is under fitting zeros.

check_model(m2_pois)

check_overdispersion(m2_pois)
## # Overdispersion test
## 
##        dispersion ratio =    67.033
##   Pearson's Chi-Squared = 16758.289
##                 p-value =   < 0.001
check_zeroinflation(m2_pois)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 0
##             Ratio: 0.00

g. Fit a negative binomial model using the function glm.nb() from the package MASS and check model diagnostics

h. In 1-2 sentences explain rationale for fitting this GLM model.

Because the Poisson model showed zero inflation and overdispersion, a negative binomial GLM is appropriate because it introduces an extra dispersion parameter to account for variance exceeding the mean and can better handle excess zeros (ie. like in the observed data), providing more accurate estimates and inference. This approach prevents overestimating the treatment effect’s significance in the Poisson model.

i. Interpret the treatment estimate result in your own words. Compare with results from the previous model.

In the Poisson model, the treatment effect was statistically significant (p < 0.05), indicating a 24% increase in spiny lobster counts in transects within MPAs compared to non-MPAs. However, this model likely overestimated significance due to zero inflation and overdispersion. The negative binomial model also estimated a 24% increase, but the effect was not statistically significant (p = 0.22). This model provides a better fit, as the posterior predictive checks show simulations closely match the observed data and nearly all residuals fall along the line in the qqplot. Although the negative binomial model accounts for overdispersion, some zero inflation remains, which may influence the observed significance of the treatment effect.

library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ##
# NOTE: The `glm.nb()` function does not require a `family` argument

m3_nb <- glm.nb(count ~ treat, data = spiny_counts) 

summ(m3_nb, exp = T)
Observations 252
Dependent variable count
Type Generalized linear model
Family Negative Binomial(0.55)
Link log
𝛘²(250) 1.52
p 0.22
Pseudo-R² (Cragg-Uhler) 0.01
Pseudo-R² (McFadden) 0.00
AIC 2088.53
BIC 2099.12
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 22.73 18.02 28.66 26.40 0.00
treat 1.24 0.88 1.73 1.23 0.22
Standard errors: MLE
# Calculate percent change: ((New IRR - Old IRR) / Old IRR) * 100, where "1" refers to the treatment group
print(paste("The percent change in the predictor is", ((1.24-1)/1)*100, "%."))
## [1] "The percent change in the predictor is 24 %."
check_overdispersion(m3_nb)
## # Overdispersion test
## 
##  dispersion ratio = 1.398
##           p-value = 0.088
check_zeroinflation(m3_nb)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 30
##             Ratio: 1.12
check_predictions(m3_nb)

check_model(m3_nb)


Step 6: Compare models

a. Use the export_summ() function from the jtools package to look at the three regression models you fit side-by-side.

c. Write a short paragraph comparing the results. Is the treatment effect robust or stable across the model specifications.

Across all three models, the estimated treatment effect remains rather consistent (stable), corresponding to a roughly 24% increase in spiny lobster counts within MPAs compared to non-MPAs based on the calculated percent change of the treatment effect in each model. While the Poisson model indicates a statistically significant effect, the OLS and negative binomial models show that the effect is not significant even if overdispersion is accounted for (the negative binomial model still experiences zero-inflation). This indicates that the size of the treatment effect appears consistent across models, but whether it appears statistically significant depends on which model is used and how the data are distributed.

export_summs( # ADD MODELS
        m1_ols, m2_pois, m3_nb,
             model.names = c("OLS","Poisson", "NB"),
             statistics = "none")
OLSPoissonNB
(Intercept)22.73 ***3.12 ***3.12 ***
(3.57)   (0.02)   (0.12)   
treat5.36    0.21 ***0.21    
(5.20)   (0.03)   (0.17)   
*** p < 0.001; ** p < 0.01; * p < 0.05.
# Calculate percent change
(5.36/22.73) *100  # Linear regression
## [1] 23.58117
(exp(.21)-1/1)*100 # Poisson regression
## [1] 23.36781
(exp(.21)-1/1)*100 # Negative binomial regression
## [1] 23.36781

Step 7: Building intuition - fixed effects

a. Create new df with the year variable converted to a factor

b. Run the following negative binomial model using glm.nb()

  • Add fixed effects for year (i.e., dummy coefficients)
  • Include an interaction term between variables treat & year (treat*year)

c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)

The negative binomial model is estimating how spiny lobster counts vary across years, between treatment groups, and in the interaction between year and treatment. Essentially, it is comparing the expected counts for different sub groups (ex. MPA transects vs non-MPA transects each year) while also accounting for overdispersion in the outcome. The model captures both overall trends across years and how the effect of treatment changes over time.

d. Explain why the main effect for treatment is negative? *Does this result make sense?

The main effect for treatment is negative because it represents the baseline difference between the treatment (transects in an MPA) and control group (transects in non-MPAs) when all years are set to zero (ie. the reference year). This does not mean that the treatment reduces the spiny lobster counts overall but indicates that in the reference year, the MPA transects had a lower average count of spiny lobsters than non-MPA transects. The positive “treatment by year” interaction terms show that counts in MPAs increase over time, indicating consistency with expected benefits of protection provided by MPAs.

ff_counts <- spiny_counts %>% 
    mutate(year=as_factor(year))
    
m5_fixedeffs <- glm.nb(
    count ~ 
        treat +
        year +
        treat*year,
    data = ff_counts)

summ(m5_fixedeffs, model.fit = FALSE)
Observations 252
Dependent variable count
Type Generalized linear model
Family Negative Binomial(0.8129)
Link log
Est. S.E. z val. p
(Intercept) 2.35 0.26 8.89 0.00
treat -1.72 0.42 -4.12 0.00
year2013 -0.35 0.38 -0.93 0.35
year2014 0.08 0.37 0.21 0.84
year2015 0.86 0.37 2.32 0.02
year2016 0.90 0.37 2.43 0.01
year2017 1.56 0.37 4.25 0.00
year2018 1.04 0.37 2.81 0.00
treat:year2013 1.52 0.57 2.66 0.01
treat:year2014 2.14 0.56 3.80 0.00
treat:year2015 2.12 0.56 3.79 0.00
treat:year2016 1.40 0.56 2.50 0.01
treat:year2017 1.55 0.56 2.77 0.01
treat:year2018 2.62 0.56 4.69 0.00
Standard errors: MLE

e. Look at the model predictions: Use the interact_plot() function from package interactions to plot mean predictions by year and treatment status.

f. Re-evaluate your responses (c) and (d) above.

interact_plot(m5_fixedeffs, pred = year, modx = treat, 
              outcome.scale = "link") # NOTE: y-axis on log-scale

# HINT: Change `outcome.scale` to "response" to convert y-axis scale to counts
interact_plot(m5_fixedeffs, pred = year, modx = treat, 
              outcome.scale = "response") 

g. Using ggplot() create a plot in same style as the previous interaction plot, but displaying the original scale of the outcome variable (lobster counts). This type of plot is commonly used to show how the treatment effect changes across discrete time points (i.e., panel data).

The plot should have… - year on the x-axis - counts on the y-axis - mpa as the grouping variable

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

plot_counts <- spiny_counts %>%
  group_by(year, mpa) %>%
  summarize(mean_count = mean(count, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(x = factor(year), y = mean_count, color = factor(mpa), group = mpa)) +
  geom_point(size = 3) +
  geom_line(size = 1) +
  theme_bw() +
  labs(
    x = "Year",
    y = "Mean Lobster Count",
    color = "MPA Status", 
    title = "Average Spiny Lobster Counts Overtime in MPA and Non-MPA transects"
  ) +
  scale_color_manual(values = c("blue", "lightblue"))

plot_counts


Step 8: Reconsider causal identification assumptions

  1. Discuss whether you think spillover effects are likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)

    Spillover effects are possible in this study because lobsters in MPAs (NAPL, IVEE) could potentially migrate to nearby non-MPA reefs (AQUE, CARP, MOHK). This movement could violate the SUTVA principle because the designation of one reef as an MPA could influence outcomes at nearby control (non-MPA) sites. If spillover occurs, then it would likely reduce the observed difference between the MPAs and non-MPAs (control sites), making it difficult to detect a causal effect of protection. While this effect may be ecologically desirable due to potentially increases lobster counts outside of MPAs, it would be difficult to precisely measure the treatment effect of MPA designation on spiny lobster counts from a research perspective.

  2. Explain why spillover is an issue for the identification of causal effects.

    Spillover is an issue for identifying causal effects because it violates the SUTVA assumption. The SUTVA assumption assumes that each unit’s outcome depends on only its own treatment. Spillover could occur if lobsters move from MPAs to nearby non-MPAs. As such, spillover can potentially mask longer term treatment effects because the difference in effect between MPA and non-MPA is reduced, making it difficult to isolate for the effect of protection in MPAs on average spiny lobster count.

  3. How does spillover relate to impact in this research setting?

    If the SUTVA assumption is violated by movement of lobsters between MPAs and non-MPAs, then it becomes more difficult to measure the causal effects since the difference between the MPAs and non-MPAs effect is reduced. Essentially, lobsters moving from MPAs into nearby non-MPAs could increase lobster abundance and decrease the observable treatment-control difference that conservationists and researchers rely on for measuring the impact of protection on increase species density.

  4. Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:

    1. SUTVA: Stable Unit Treatment Value assumption: This assumption assumes that each unit’s outcome depends only on its own treatment and not on the treatment assigned to other units. SUTVA could be violated due to spillover effects. For instance, lobsters may move between an MPA and non-MPA and populate outside MPAs during their reproductive stage to explore varied food items. As a result, the non-MPA lobster counts are partially affected by neighboring MPA sites, reducing the difference between the treatment (MPA) and control (non-MPA) outcomes. So, this assumption if likely unreasonable unless short-term or isolated studies can potentially accommodate the SUTVA assumption if spillover is minimal.
    2. Exogeneity assumption: Based on the exogeneity assumption, the MPA designation should be unrelated to unobserved (environmental or anthropogenic) factors; the MPA designation status should not be correlated with the regression term. The assumption could be violated if MPAs are placed in areas with characteristics that also influence lobster outcomes, such as greater habitat quality or lower historical fishing pressure. These factors could independently affect lobster abundance and confound the estimated treatment effect. This assumption is probably not reasonable as MPAs are likely to not be randomly assigned; they may be assigned based on specific environmental criteria instead.

EXTRA CREDIT

Use the recent lobster abundance data with observations collected up until 2024 (extracredit_sblobstrs24.csv) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.

  1. Create a new script for the analysis on the updated data
  2. Run at least 3 regression models & assess model diagnostics
  3. Compare and contrast results with the analysis from the 2012-2018 data sample (~ 2 paragraphs)